stacked semantic-guided attention model
Stacked Semantics-Guided Attention Model for Fine-Grained Zero-Shot Learning
Zero-Shot Learning (ZSL) is generally achieved via aligning the semantic relationships between the visual features and the corresponding class semantic descriptions. However, using the global features to represent fine-grained images may lead to sub-optimal results since they neglect the discriminative differences of local regions. Besides, different regions contain distinct discriminative information. The important regions should contribute more to the prediction. To this end, we propose a novel stacked semantics-guided attention (S2GA) model to obtain semantic relevant features by using individual class semantic features to progressively guide the visual features to generate an attention map for weighting the importance of different local regions. Feeding both the integrated visual features and the class semantic features into a multi-class classification architecture, the proposed framework can be trained end-to-end. Extensive experimental results on CUB and NABird datasets show that the proposed approach has a consistent improvement on both fine-grained zero-shot classification and retrieval tasks.
Reviews: Stacked Semantics-Guided Attention Model for Fine-Grained Zero-Shot Learning
Summary This paper presents a stacked semantics-guided attention (S2GA) model for improved zero-shot learning. The main idea of this paper is that important regions should contribute more to the prediction. To this end, the authors design an attention method to distribute different weights for different regions according to their relevance with class semantic features and integrate both the global visual features and the weighted region features into more semantics-relevant features to represent images. Strengths The method is well motivated. The presentation of the method is clear. Using stacked attention for zero-shot learning seems to be a new idea (I do not check exhaustively).
Stacked Semantics-Guided Attention Model for Fine-Grained Zero-Shot Learning
yu, yunlong, Ji, Zhong, Fu, Yanwei, Guo, Jichang, Pang, Yanwei, Zhang, Zhongfei (Mark)
Zero-Shot Learning (ZSL) is generally achieved via aligning the semantic relationships between the visual features and the corresponding class semantic descriptions. However, using the global features to represent fine-grained images may lead to sub-optimal results since they neglect the discriminative differences of local regions. Besides, different regions contain distinct discriminative information. The important regions should contribute more to the prediction. To this end, we propose a novel stacked semantics-guided attention (S2GA) model to obtain semantic relevant features by using individual class semantic features to progressively guide the visual features to generate an attention map for weighting the importance of different local regions.